9 research outputs found
The Journey is the Reward: Unsupervised Learning of Influential Trajectories
Unsupervised exploration and representation learning become increasingly
important when learning in diverse and sparse environments. The
information-theoretic principle of empowerment formalizes an unsupervised
exploration objective through an agent trying to maximize its influence on the
future states of its environment. Previous approaches carry certain limitations
in that they either do not employ closed-loop feedback or do not have an
internal state. As a consequence, a privileged final state is taken as an
influence measure, rather than the full trajectory. We provide a model-free
method which takes into account the whole trajectory while still offering the
benefits of option-based approaches. We successfully apply our approach to
settings with large action spaces, where discovery of meaningful action
sequences is particularly difficult.Comment: ICML'19 ERL Worksho
Multimodal Transitions for Generative Stochastic Networks
Generative Stochastic Networks (GSNs) have been recently introduced as an
alternative to traditional probabilistic modeling: instead of parametrizing the
data distribution directly, one parametrizes a transition operator for a Markov
chain whose stationary distribution is an estimator of the data generating
distribution. The result of training is therefore a machine that generates
samples through this Markov chain. However, the previously introduced GSN
consistency theorems suggest that in order to capture a wide class of
distributions, the transition operator in general should be multimodal,
something that has not been done before this paper. We introduce for the first
time multimodal transition distributions for GSNs, in particular using models
in the NADE family (Neural Autoregressive Density Estimator) as output
distributions of the transition operator. A NADE model is related to an RBM
(and can thus model multimodal distributions) but its likelihood (and
likelihood gradient) can be computed easily. The parameters of the NADE are
obtained as a learned function of the previous state of the learned Markov
chain. Experiments clearly illustrate the advantage of such multimodal
transition distributions over unimodal GSNs.Comment: 7 figures, 9 pages, submitted to ICLR1
Deep Directed Generative Autoencoders
For discrete data, the likelihood can be rewritten exactly and
parametrized into if
has enough capacity to put no probability mass on any for which , where is a deterministic discrete function. The log of the
first factor gives rise to the log-likelihood reconstruction error of an
autoencoder with as the encoder and as the (probabilistic)
decoder. The log of the second term can be seen as a regularizer on the encoded
activations , e.g., as in sparse autoencoders. Both encoder and decoder
can be represented by a deep neural network and trained to maximize the average
of the optimal log-likelihood . The objective is to learn an encoder
that maps to that has a much simpler distribution than
itself, estimated by . This "flattens the manifold" or concentrates
probability mass in a smaller number of (relevant) dimensions over which the
distribution factorizes. Generating samples from the model is straightforward
using ancestral sampling. One challenge is that regular back-propagation cannot
be used to obtain the gradient on the parameters of the encoder, but we find
that using the straight-through estimator works well here. We also find that
although optimizing a single level of such architecture may be difficult, much
better results can be obtained by pre-training and stacking them, gradually
transforming the data distribution into one that is more easily captured by a
simple parametric model
On Variational Bounds of Mutual Information
Estimating and optimizing Mutual Information (MI) is core to many problems in
machine learning; however, bounding MI in high dimensions is challenging. To
establish tractable and scalable objectives, recent work has turned to
variational bounds parameterized by neural networks, but the relationships and
tradeoffs between these bounds remains unclear. In this work, we unify these
recent developments in a single framework. We find that the existing
variational lower bounds degrade when the MI is large, exhibiting either high
bias or high variance. To address this problem, we introduce a continuum of
lower bounds that encompasses previous bounds and flexibly trades off bias and
variance. On high-dimensional, controlled problems, we empirically characterize
the bias and variance of the bounds and their gradients and demonstrate the
effectiveness of our new bounds for estimation and representation learning.Comment: ICML 201
SketchTransfer: A Challenging New Task for Exploring Detail-Invariance and the Abstractions Learned by Deep Networks
Deep networks have achieved excellent results in perceptual tasks, yet their
ability to generalize to variations not seen during training has come under
increasing scrutiny. In this work we focus on their ability to have invariance
towards the presence or absence of details. For example, humans are able to
watch cartoons, which are missing many visual details, without being explicitly
trained to do so. As another example, 3D rendering software is a relatively
recent development, yet people are able to understand such rendered scenes even
though they are missing details (consider a film like Toy Story). The failure
of machine learning algorithms to do this indicates a significant gap in
generalization between human abilities and the abilities of deep networks. We
propose a dataset that will make it easier to study the detail-invariance
problem concretely. We produce a concrete task for this: SketchTransfer, and we
show that state-of-the-art domain transfer algorithms still struggle with this
task. The state-of-the-art technique which achieves over 95\% on MNIST
SVHN transfer only achieves 59\% accuracy on the
SketchTransfer task, which is much better than random (11\% accuracy) but falls
short of the 87\% accuracy of a classifier trained directly on labeled
sketches. This indicates that this task is approachable with today's best
methods but has substantial room for improvement.Comment: Accepted WACV 202
Maximum Entropy Generators for Energy-Based Models
Maximum likelihood estimation of energy-based models is a challenging problem
due to the intractability of the log-likelihood gradient. In this work, we
propose learning both the energy function and an amortized approximate sampling
mechanism using a neural generator network, which provides an efficient
approximation of the log-likelihood gradient. The resulting objective requires
maximizing entropy of the generated samples, which we perform using recently
proposed nonparametric mutual information estimators. Finally, to stabilize the
resulting adversarial game, we use a zero-centered gradient penalty derived as
a necessary condition from the score matching literature. The proposed
technique can generate sharp images with Inception and FID scores competitive
with recent GAN techniques, does not suffer from mode collapse, and is
competitive with state-of-the-art anomaly detection techniques
MINE: Mutual Information Neural Estimation
We argue that the estimation of mutual information between high dimensional
continuous random variables can be achieved by gradient descent over neural
networks. We present a Mutual Information Neural Estimator (MINE) that is
linearly scalable in dimensionality as well as in sample size, trainable
through back-prop, and strongly consistent. We present a handful of
applications on which MINE can be used to minimize or maximize mutual
information. We apply MINE to improve adversarially trained generative models.
We also use MINE to implement Information Bottleneck, applying it to supervised
classification; our results demonstrate substantial improvement in flexibility
and performance in these settings.Comment: 19 pages, 6 figure
Generative Adversarial Networks
We propose a new framework for estimating generative models via an
adversarial process, in which we simultaneously train two models: a generative
model G that captures the data distribution, and a discriminative model D that
estimates the probability that a sample came from the training data rather than
G. The training procedure for G is to maximize the probability of D making a
mistake. This framework corresponds to a minimax two-player game. In the space
of arbitrary functions G and D, a unique solution exists, with G recovering the
training data distribution and D equal to 1/2 everywhere. In the case where G
and D are defined by multilayer perceptrons, the entire system can be trained
with backpropagation. There is no need for any Markov chains or unrolled
approximate inference networks during either training or generation of samples.
Experiments demonstrate the potential of the framework through qualitative and
quantitative evaluation of the generated samples
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale